Protein Fingerprinting
Molecular Dynamics Group
      Pept. & Prot. Home 2D-IR Home Home
 
Back to 2D IR Research Topics

The direct correlation between the intensity of the 2D IR features and the amino acid number, demonstrated in the peptides section, shows that it is possible to perform fingerprinting of peptides with the DOVE-FWM form of 2D IR spectroscopy (called EVV 2DIR spectroscopy) when using methylene groups as an internal reference. The linear behaviour demonstrated comforts the idea of using this technique as a proteins fingerprinting. Using established relationships like the ones shown as results of the peptides section, it is possible to determine a quantity ratio using a measured intensity ratio. Ultimately, the strategy will be to use a database of amino acid quantity ratios determined from a database of known proteins. For a given unidentified protein, the intensity ratios will be measured for a certain number of identified peaks, each corresponding to a specific known amino acid. The corresponding quantity ratios will be compared to the protein ratios database and then the corresponding protein will be identified.

Our recent results published in Proceedings of the National Academy of Sciences (PNAS article) and presented below, demonstrate that it is possible to differentiate proteins by their relative amino acid contents measured by EVV 2DIR spectroscopy.

A bioinformatics analysis using an amino acid to CH3 ratios database based on 33000 human proteins, shows that identifying the relative levels of only five amino acids would allow around 44% of the proteins in the ENSEMBL human protein database to be uniquely identified and around 72% to be one of only two proteins.

 

Bioinformatics

The following histograms demonstrate the feasibility of identifying proteins by using their amino acid/CH3 ratios. Tests were performed over our human proteome database; amino acid/CH3 ratios and their precisions were used as search parameters for each protein in the database.

The horizontal axes correspond to the number of protein outputs from the database when a search was performed for a protein. The vertical axes represent the frequency with which a particular number of hits were output when the search was performed for each protein of the database (33000) in turn. (a) Shown is the number of hits output from the database when three amino acid/CH3 ratios (Tyr/CH3, Trp/CH3, and Phe/CH3) were input with 10% precision for each protein. (b) Shown is the number of protein hits returned when the database search was extended to using five amino acid ratios (by also using the His/CH3 and Cys/CH3 ratios). (Insets) Histograms show the results for when the molecular weight (with 10% precision) was also included as a search parameter. The first bar of b shows that around 15000 proteins of the 33000 present in the database (around 44%) gave only one hit and thus were uniquely identifiable. The second bar shows that 9000 proteins gave two protein hits and so could be one of only two possible database candidates. When the molecular weight was also used as a parameter, around 20000 (60%) of the proteins were unambiguously identified.

 

Amino acid cross-peaks identification

The first step of our protein differentiation/identification procedure is to identify 2DIR features specific to amino acids in proteins. Our previous work on peptides is a guide for assigning amino acid cross-peaks in protein EVV 2DIR spectra. The following spectra show the amino acid cross-peaks identify: phenylalanine, tyrosine and tryptophan.

EVV 2DIR spectra of pepsin measured with two different polarization combinations (left hand side): PPP (all beams having their fields in the plane of propagation) and PPS (IR beams polarized in the plane of propagation and the visible normal to the IRs). The spectra were measured for the same set of pulse delays: T12=2 ps, T23=1 ps, and are plotted on the same intensity scale. The cross-peaks used in this study were mainly identified from previous studies of peptides. The comparison of PPS spectra of pepsin and alpha-chymotrypsin (protein rich in tryptophan) shows a clearly visible tryptophan cross-peak.

The intensity of four amino acid cross-peaks corresponding to three amino acid residues are monitored for a set of 10 different known proteins to perform the relative quantification procedure: the phenylalanine ‘‘Phe 1’’ and tyrosine ‘‘Tyr’’ cross-peaks in the PPP polarization scheme, and the phenylalanine ‘‘Phe 2(S)’’ and tryptophan ‘‘Trp’’ cross-peaks in the PPS polarization scheme. The ‘‘Phe 2(P)’’ feature is considered congested and therefore not exploited. The internal reference ‘‘CH3’’ is measured for both polarization schemes.

 

Amino acid quantification

The measured ratios of amino acid peak intensity to internal reference intensity plotted against the known ratios for the 10 proteins and for the four identified amino acid peaks show linear variations as shown in our previous work with peptides.

Each data point comes from one of the 10 protein species analyzed. The solid lines are the linear fits constrained through the origin, and the error bars are standard deviations from four repeat measurements. The calculated average precisions on the amino acid/CH3 ratios were deduced from the average experimental error bars and are denoted ‘‘Precision’’ on the graphs. The horizontal dispersions of the data points compared with the linear fits are the average absolute differences and are denoted ‘‘Dispersion.’’ This is essentially the standard deviation due to variation of the individual protein points from the linear fit. For the case of tryptophan, where the dispersion is greater than the precision, the implication is that there is some residual structural effect influencing the cross-peak intensity.

 

Protein differentiation procedure

The differentiation procedure relies on the calculation of overlap integrals. For a protein Px, and a peak corresponding to an amino acid AAn, the distribution of expected ratios is described as follows:

where RcAAn is the average expected ratio for the amino acid n, deduced from the measured ratio using the linear fit and SDAAn (the average reproducibility standard deviation for amino acid n).

The overlap integral used to compare the proteins Px and Py, taking into account n peaks, is given by:

This overlap integral is normalized relative to the diagonal value corresponding to the comparison of a protein with itself: IPxPx (n Peaks).

The following graph illustrate the procedure of differentiation: normal distributions representing the expected amino acid ratios are deduced from the variation of the measured ratios as a function of the known ratios. The width of the gaussian reflects the error bar of the measurement.

When the distributions are overlapped (case of the amino acid AA1) the overlap integral is 1 and therefore the two proteins not distinguishable. Introducing the amino acid AA2 increases the distinguishability and introducing the amino acid AA3 make the two proteins fully distinguishable.

 

Protein differentiation results

The results of these calculations are presented as two dimensional maps, in which the intensity corresponds to the value of the normalized integral from 0 (black) to 1 (white) and reflects the probability of two proteins being identical (i.e., their probability of identicality).

The tryptophan alone (Trp PPS) helps distinguishing several of the proteins in the set; instead the tyrosine alone (Tyr) is very poor for differentiation. The maximum degree of differentiation is obtained when using the set of four amino acids (Tyr and Phe1, and Phe 2(S) and Trp).

The high throughput capability of our strategy are also investigated and are presented in the following two figures.

Protein discernability performances deduced from the differentiation maps. (a) Shown are the cumulative number of pairs of distinguishable proteins as a function of the tolerance on the probability of identicality for four different fingerprinting schemes (120 s acquisition time per amino acid peak): the single tyrosine PPP measurement scheme (a), the tyrosine and phenylalanine PPP scheme b, the tyrosine and phenylalanine PPP and tryptophan PPS scheme c, and the complete set of peaks d. (b) Shown is the cumulative number of pairs of distinguishable proteins as a function of the data acquisition time per protein. Data are for a differentiation strategy using all four amino acid cross-peaks and a 10% probability of two proteins being identical (the dotted curve is a guide for the eyes). For a 2 to 4min measurement time per protein, the four cross-peak scheme with an acceptance of 10% probability of identicality gives a good result of 39 pairs of discernible proteins out of a total of 45. We estimate that, with the current technology, the signal-averaging time per amino acid peak can be shortened to between 1 and 10 s. This gives a realistic protein identification time of somewhere between 10 s and 2 min.